Ralph Barnes

On my honor, as a student, I have neither given nor received unauthorized aid on this academic work. The assignment I am submitting is my own containing my own codes and words.

In [1]:
# import packages
import requests as rq
import pandas as pd
import numpy as np
import random as rd
from lxml import html
import csv

# for plotting
import matplotlib as plt
%matplotlib inline
import plotly
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from plotly.graph_objs import *
init_notebook_mode()
import seaborn as sns

Data Cleaning Tutorial

Data Analysis cannot be successful if the data to be analyzed is incomplete. Performing statistical calculations on age, for example, could be disasterous if someone is mislabeld as being 350 years old!

Data cleaning is 90% of data analytics. Bad data = bad analytics!

Assistance with this tutorial came from DataCamp (multiple examples).

https://campus.datacamp.com/courses/cleaning-data-in-python/exploring-your-data

Other assistance from

Pandas Pydata https://pandas.pydata.org/pandas-docs/stable/missing_data.html

GitHub https://github.com/KarrieK/pandas_data_cleaning

Reading in the dataset.

At this point, you should know some basic Python commands. For example, let's read in our test data file for this tutorial.

In [2]:
# Import csv file into dataframe (df) for data cleaning 
# data from NYC Open Data (https://data.cityofnewyork.us/Housing-Development/DOB-Job-Application-Filings/ic3t-wcy2)
df = pd.read_csv('data/dob_job_application_filings.csv', sep=',', low_memory=False)

Memory Issue?

Note - I found that in reading this dataset, several columns contained mixed data and caused an issue.

Adding "low_memory=False" to the pd.read_csv function to avoid this issue.

We can also make a copy of the data prior to cleaning it, to ensure we maintain original data.

This is a new function we did not previously learn.

In [3]:
# copy dataset
df2 = df.copy()
In [4]:
# To see and display top 5 rows
df2.head()
Out[4]:
Job # Doc # Borough House # Street Name Block Lot Bin # Job Type Job Status ... Owner's Last Name Owner's Business Name Owner's House Number Owner'sHouse Street Name City State Zip Owner'sPhone # Job Description DOBRunDate
0 121577873 2 MANHATTAN 386 PARK AVENUE SOUTH 857 38.0 1016890.0 A2 D ... MIGLIORE MACKLOWE MANAGEMENT 126 EAST 56TH STREET NEW YORK NY 10222 2125545837 GENERAL MECHANICAL & PLUMBING MODIFICATIONS AS... 04/26/2013 12:00:00 AM
1 520129502 1 STATEN ISLAND 107 KNOX PLACE 342 1.0 5161350.0 A3 A ... BLUMENBERG NA 107 KNOX PLACE STATEN ISLAND NY 10314 3477398892 BUILDERS PAVEMENT PLAN 143 LF. ... 04/26/2013 12:00:00 AM
2 121601560 1 MANHATTAN 63 WEST 131 STREET 1729 9.0 1053831.0 A2 Q ... MARKOWITZ 635 RIVERSIDE DRIVE NY LLC 619 WEST 54TH STREET NEW YORK NY 10016 2127652555 GENERAL CONSTRUCTION TO INCLUDE NEW PARTITIONS... 04/26/2013 12:00:00 AM
3 121601203 1 MANHATTAN 48 WEST 25TH STREET 826 69.0 1015610.0 A2 D ... CASALE 48 W 25 ST LLC C/O BERNSTEIN 150 WEST 30TH STREET NEW YORK NY 10001 2125941414 STRUCTURAL CHANGES ON THE 5TH FLOOR (MOONDOG E... 04/26/2013 12:00:00 AM
4 121601338 1 MANHATTAN 45 WEST 29 STREET 831 7.0 1015754.0 A3 D ... LEE HYUNG-HYANG REALTY CORP 614 8 AVENUE NEW YORK NY 10001 2019881222 FILING HEREWITH FACADE REPAIR PLANS. WORK SCOP... 04/26/2013 12:00:00 AM

5 rows × 82 columns

This file has some basic information: Job #, House #, etc.

In [5]:
# See the end (tail) of the data, or last 5 rows
df2.tail()
Out[5]:
Job # Doc # Borough House # Street Name Block Lot Bin # Job Type Job Status ... Owner's Last Name Owner's Business Name Owner's House Number Owner'sHouse Street Name City State Zip Owner'sPhone # Job Description DOBRunDate
322525 140713958 1 MANHATTAN 345 5TH AVENUE 863 4.0 1017075.0 A2 H ... OHEBSHALOM REGENCY ENTERPRISES PARTNERS LLC 347 FIFTH AVENUE NEW YORK NY 10016 2126865252 FILE FOR MECHANICAL AND SPRINKLER WORK AS PER ... 11/06/2017 12:00:00 AM
322526 140713976 1 MANHATTAN 517 COLUMBUS AVENUE 1198 7502.0 1031339.0 A2 H ... KONTOGIANNIS EGS, LLC 517 COLUMBUS AVENUE NEW YORK NY 10024 9177692434 INTERIOR ALTERATION OF EXISTING EATING AND DRI... 11/06/2017 12:00:00 AM
322527 340560164 1 BROOKLYN 1559 BUSHWICK AVE 3471 9.0 3080494.0 A2 D ... MELENDEZ NaN 74 GRANT AVE QUEENS NY 11208 7188097151 TO INSTALL SPRINKLERS IN THE NEW RENOVATED BUI... 11/06/2017 12:00:00 AM
322528 140713949 1 MANHATTAN 345 5TH AVE 863 4.0 1017075.0 A2 H ... OHEBSHALOM REGENCY ENTERPRISES PARTNERS LLC 347 FIFTH AVENUE NEW YORK NY 10016 2126865252 FILE STRUCTURAL WORK TYPE AS PER PLAN. NO CHAN... 11/06/2017 12:00:00 AM
322529 540138331 1 STATEN ISLAND 62 MILBANK ROAD 4092 72.0 5057106.0 A2 H ... GRASSO NONE 62 MILBANK ROAD STATEN ISLAND NY 10306 3476238868 PROPOSED SPRINKLER SYSTEM IS DESIGNED IN ACCOR... 11/06/2017 12:00:00 AM

5 rows × 82 columns

When using functions there are the "dot" notation or bracket notation.

Dot notation is using the column name with a leading period - df.house

Bracket notation is using the column name inside brackets - df['house#']

We can use .shape and .columns to see the data. Since these are not methods, there is no need to use parenthesis.

In [6]:
# See the shape of the data.
df2.shape
Out[6]:
(322530, 82)

So there are 322,530 rows and 82 columns in this dataset.

In [7]:
# See the columns
df2.columns
Out[7]:
Index([u'Job #', u'Doc #', u'Borough', u'House #', u'Street Name', u'Block',
       u'Lot', u'Bin #', u'Job Type', u'Job Status', u'Job Status Descrp',
       u'Latest Action Date', u'Building Type', u'Community - Board',
       u'Cluster', u'Landmarked', u'Adult Estab', u'Loft Board', u'City Owned',
       u'Little e', u'PC Filed', u'eFiling Filed', u'Plumbing', u'Mechanical',
       u'Boiler', u'Fuel Burning', u'Fuel Storage', u'Standpipe', u'Sprinkler',
       u'Fire Alarm', u'Equipment', u'Fire Suppression', u'Curb Cut', u'Other',
       u'Other Description', u'Applicant's First Name',
       u'Applicant's Last Name', u'Applicant Professional Title',
       u'Applicant License #', u'Professional Cert', u'Pre- Filing Date',
       u'Paid', u'Fully Paid', u'Assigned', u'Approved', u'Fully Permitted',
       u'Initial Cost', u'Total Est. Fee', u'Fee Status',
       u'Existing Zoning Sqft', u'Proposed Zoning Sqft', u'Horizontal Enlrgmt',
       u'Vertical Enlrgmt', u'Enlargement SQ Footage', u'Street Frontage',
       u'ExistingNo. of Stories', u'Proposed No. of Stories',
       u'Existing Height', u'Proposed Height', u'Existing Dwelling Units',
       u'Proposed Dwelling Units', u'Existing Occupancy',
       u'Proposed Occupancy', u'Site Fill', u'Zoning Dist1', u'Zoning Dist2',
       u'Zoning Dist3', u'Special District 1', u'Special District 2',
       u'Owner Type', u'Non-Profit', u'Owner's First Name',
       u'Owner's Last Name', u'Owner's Business Name', u'Owner's House Number',
       u'Owner'sHouse Street Name', u'City ', u'State', u'Zip',
       u'Owner'sPhone #', u'Job Description', u'DOBRunDate'],
      dtype='object')

That is a lot of data in 82 columns!

However, we have found that there is a mixture of data here, so we need to find their types.

Float64 are numerical with decimal notation.

Int64 are numerical without decimal notation.

Obj are string or categorical columns.

To find out the data types, we need to use the info() function.

In [8]:
# Find out data types
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 322530 entries, 0 to 322529
Data columns (total 82 columns):
Job #                           322530 non-null int64
Doc #                           322530 non-null int64
Borough                         322530 non-null object
House #                         322530 non-null object
Street Name                     322530 non-null object
Block                           322530 non-null int64
Lot                             322529 non-null float64
Bin #                           322529 non-null float64
Job Type                        322529 non-null object
Job Status                      322529 non-null object
Job Status Descrp               322529 non-null object
Latest Action Date              322529 non-null object
Building Type                   322529 non-null object
Community - Board               322344 non-null object
Cluster                         1 non-null object
Landmarked                      44871 non-null object
Adult Estab                     19 non-null object
Loft Board                      1377 non-null object
City Owned                      39175 non-null object
Little e                        2987 non-null object
PC Filed                        0 non-null float64
eFiling Filed                   319803 non-null object
Plumbing                        144866 non-null object
Mechanical                      111080 non-null object
Boiler                          73565 non-null object
Fuel Burning                    69952 non-null object
Fuel Storage                    69391 non-null object
Standpipe                       70600 non-null object
Sprinkler                       87278 non-null object
Fire Alarm                      82015 non-null object
Equipment                       120057 non-null object
Fire Suppression                72519 non-null object
Curb Cut                        72176 non-null object
Other                           209441 non-null object
Other Description               209438 non-null object
Applicant's First Name          322528 non-null object
Applicant's Last Name           322528 non-null object
Applicant Professional Title    322528 non-null object
Applicant License #             321087 non-null object
Professional Cert               182363 non-null object
Pre- Filing Date                322528 non-null object
Paid                            308230 non-null object
Fully Paid                      308321 non-null object
Assigned                        118180 non-null object
Approved                        104601 non-null object
Fully Permitted                 24495 non-null object
Initial Cost                    322528 non-null object
Total Est. Fee                  322528 non-null object
Fee Status                      322528 non-null object
Existing Zoning Sqft            322528 non-null float64
Proposed Zoning Sqft            322528 non-null float64
Horizontal Enlrgmt              5979 non-null object
Vertical Enlrgmt                4204 non-null object
Enlargement SQ Footage          322528 non-null float64
Street Frontage                 322528 non-null float64
ExistingNo. of Stories          322528 non-null float64
Proposed No. of Stories         322528 non-null float64
Existing Height                 322528 non-null float64
Proposed Height                 322528 non-null float64
Existing Dwelling Units         208857 non-null object
Proposed Dwelling Units         210476 non-null object
Existing Occupancy              274832 non-null object
Proposed Occupancy              278555 non-null object
Site Fill                       216818 non-null object
Zoning Dist1                    281676 non-null object
Zoning Dist2                    40267 non-null object
Zoning Dist3                    2371 non-null object
Special District 1              70655 non-null object
Special District 2              21185 non-null object
Owner Type                      0 non-null float64
Non-Profit                      18253 non-null object
Owner's First Name              322520 non-null object
Owner's Last Name               322520 non-null object
Owner's Business Name           281466 non-null object
Owner's House Number            322520 non-null object
Owner'sHouse Street Name        322512 non-null object
City                            322528 non-null object
State                           322527 non-null object
Zip                             322527 non-null object
Owner'sPhone #                  322526 non-null object
Job Description                 318959 non-null object
DOBRunDate                      322511 non-null object
dtypes: float64(12), int64(3), object(67)
memory usage: 201.8+ MB

We can easily determine that we have missing values by taking each columns total number of non-null objects and subtracting that from 322,530 columns.

The difference is the missing data in that column.

Handing missing or bad data.

Frequencies greater than the anticipated sum is also an indicator of missing or bad data.

We will use the "value_counts" function. Note the use of dropna = False to also count the null values, if any exist.

In [9]:
# Find frequencies of specific landmarked column data using dot notation. Because of the amont of data, we can shorten
# our search to the first 5 rows by adding .head() at the end. This is a way to slice multiple methods into one statement.
# df['Zip'].value_counts(dropna=False) would be the bracket notation and provide the same results.
df2.Landmarked.value_counts(dropna=False).head()
Out[9]:
NaN    277659
Y       44870
NY          1
Name: Landmarked, dtype: int64
In [10]:
# we can also search the last 5 rows by using tail()
df2.Landmarked.value_counts(dropna=False).tail()
Out[10]:
NaN    277659
Y       44870
NY          1
Name: Landmarked, dtype: int64
In [11]:
# If a column name has a space, it is best to use the bracket notation.
df2['Fully Paid'].value_counts(dropna=False).head()
Out[11]:
NaN           14209
12/30/2014     1248
09/30/2016      806
02/07/2017      784
12/10/2015      732
Name: Fully Paid, dtype: int64

What's this NaN? Ah, we seem to have found some null values.

Another method for determining missing data is by use of the describe() function.

This is one of the statistical summary methods used on numerical data.

In [12]:
# General summary of data
df2.describe()
Out[12]:
Job # Doc # Block Lot Bin # PC Filed Existing Zoning Sqft Proposed Zoning Sqft Enlargement SQ Footage Street Frontage ExistingNo. of Stories Proposed No. of Stories Existing Height Proposed Height Owner Type
count 3.225300e+05 322530.000000 322530.000000 322529.000000 3.225290e+05 0.0 3.225280e+05 3.225280e+05 322528.000000 322528.000000 322528.000000 322528.000000 322528.000000 322528.000000 0.0
mean 2.549742e+08 1.174849 2847.363300 614.990339 2.410567e+06 NaN 1.883839e+03 2.487988e+03 65.107510 4.078306 7.974529 8.059768 93.805018 94.951241 NaN
std 1.327790e+08 0.558310 3208.401239 1986.519440 1.397481e+06 NaN 6.345322e+04 7.270663e+04 2349.114044 139.615311 11.907439 11.901386 167.265840 184.945273 NaN
min 1.024168e+08 1.000000 0.000000 0.000000 1.000003e+06 NaN 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN
25% 1.226378e+08 1.000000 850.000000 12.000000 1.037573e+06 NaN 0.000000e+00 0.000000e+00 0.000000 0.000000 1.000000 2.000000 20.000000 21.000000 NaN
50% 2.205800e+08 1.000000 1504.000000 32.000000 2.083322e+06 NaN 0.000000e+00 0.000000e+00 0.000000 0.000000 4.000000 4.000000 42.000000 45.000000 NaN
75% 3.403683e+08 1.000000 3678.000000 59.000000 3.397216e+06 NaN 0.000000e+00 0.000000e+00 0.000000 0.000000 8.000000 8.000000 96.000000 98.000000 NaN
max 5.401383e+08 13.000000 99999.000000 9999.000000 5.866320e+06 NaN 1.779104e+07 2.864286e+07 765555.000000 50000.000000 235.000000 235.000000 22019.000000 44101.000000 NaN

As you see, this provides the count, mean, standard deviation and other statistical summaries from the data.

Uh oh, there's more of those NaN values in the numerical columns now. So, let's look at the PC Filed column.

In [13]:
# View data in PC Filed column to find missing values
df2['PC Filed'].describe()
Out[13]:
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: PC Filed, dtype: float64

Since this is obviously a flawed column, we can simply drop this column using drop()

In [14]:
# using df3 variable so as to continue work with df2 dataset
df3 = df2.copy()
df3 = df3.drop(['PC Filed'], axis=1)
df3.head()
Out[14]:
Job # Doc # Borough House # Street Name Block Lot Bin # Job Type Job Status ... Owner's Last Name Owner's Business Name Owner's House Number Owner'sHouse Street Name City State Zip Owner'sPhone # Job Description DOBRunDate
0 121577873 2 MANHATTAN 386 PARK AVENUE SOUTH 857 38.0 1016890.0 A2 D ... MIGLIORE MACKLOWE MANAGEMENT 126 EAST 56TH STREET NEW YORK NY 10222 2125545837 GENERAL MECHANICAL & PLUMBING MODIFICATIONS AS... 04/26/2013 12:00:00 AM
1 520129502 1 STATEN ISLAND 107 KNOX PLACE 342 1.0 5161350.0 A3 A ... BLUMENBERG NA 107 KNOX PLACE STATEN ISLAND NY 10314 3477398892 BUILDERS PAVEMENT PLAN 143 LF. ... 04/26/2013 12:00:00 AM
2 121601560 1 MANHATTAN 63 WEST 131 STREET 1729 9.0 1053831.0 A2 Q ... MARKOWITZ 635 RIVERSIDE DRIVE NY LLC 619 WEST 54TH STREET NEW YORK NY 10016 2127652555 GENERAL CONSTRUCTION TO INCLUDE NEW PARTITIONS... 04/26/2013 12:00:00 AM
3 121601203 1 MANHATTAN 48 WEST 25TH STREET 826 69.0 1015610.0 A2 D ... CASALE 48 W 25 ST LLC C/O BERNSTEIN 150 WEST 30TH STREET NEW YORK NY 10001 2125941414 STRUCTURAL CHANGES ON THE 5TH FLOOR (MOONDOG E... 04/26/2013 12:00:00 AM
4 121601338 1 MANHATTAN 45 WEST 29 STREET 831 7.0 1015754.0 A3 D ... LEE HYUNG-HYANG REALTY CORP 614 8 AVENUE NEW YORK NY 10001 2019881222 FILING HEREWITH FACADE REPAIR PLANS. WORK SCOP... 04/26/2013 12:00:00 AM

5 rows × 81 columns

In [15]:
# see info of last 5 rows.
df3.tail()
Out[15]:
Job # Doc # Borough House # Street Name Block Lot Bin # Job Type Job Status ... Owner's Last Name Owner's Business Name Owner's House Number Owner'sHouse Street Name City State Zip Owner'sPhone # Job Description DOBRunDate
322525 140713958 1 MANHATTAN 345 5TH AVENUE 863 4.0 1017075.0 A2 H ... OHEBSHALOM REGENCY ENTERPRISES PARTNERS LLC 347 FIFTH AVENUE NEW YORK NY 10016 2126865252 FILE FOR MECHANICAL AND SPRINKLER WORK AS PER ... 11/06/2017 12:00:00 AM
322526 140713976 1 MANHATTAN 517 COLUMBUS AVENUE 1198 7502.0 1031339.0 A2 H ... KONTOGIANNIS EGS, LLC 517 COLUMBUS AVENUE NEW YORK NY 10024 9177692434 INTERIOR ALTERATION OF EXISTING EATING AND DRI... 11/06/2017 12:00:00 AM
322527 340560164 1 BROOKLYN 1559 BUSHWICK AVE 3471 9.0 3080494.0 A2 D ... MELENDEZ NaN 74 GRANT AVE QUEENS NY 11208 7188097151 TO INSTALL SPRINKLERS IN THE NEW RENOVATED BUI... 11/06/2017 12:00:00 AM
322528 140713949 1 MANHATTAN 345 5TH AVE 863 4.0 1017075.0 A2 H ... OHEBSHALOM REGENCY ENTERPRISES PARTNERS LLC 347 FIFTH AVENUE NEW YORK NY 10016 2126865252 FILE STRUCTURAL WORK TYPE AS PER PLAN. NO CHAN... 11/06/2017 12:00:00 AM
322529 540138331 1 STATEN ISLAND 62 MILBANK ROAD 4092 72.0 5057106.0 A2 H ... GRASSO NONE 62 MILBANK ROAD STATEN ISLAND NY 10306 3476238868 PROPOSED SPRINKLER SYSTEM IS DESIGNED IN ACCOR... 11/06/2017 12:00:00 AM

5 rows × 81 columns

As we can see, there is still NaN or null values within the data.

We can do several things with the missing data. We have already used dropna(), but there are more ways to fill in the data for a particular cell.

Let's use df# so as to retain the original df2 dataset, but show the variations of cleaning data.

Other methods of handling bad data.

In [16]:
# using ffill() function to forward fill data with zero
df4 = df2.copy()
df4 = df4.ffill(0)
df4.head()
Out[16]:
Job # Doc # Borough House # Street Name Block Lot Bin # Job Type Job Status ... Owner's Last Name Owner's Business Name Owner's House Number Owner'sHouse Street Name City State Zip Owner'sPhone # Job Description DOBRunDate
0 121577873 2 MANHATTAN 386 PARK AVENUE SOUTH 857 38.0 1016890.0 A2 D ... MIGLIORE MACKLOWE MANAGEMENT 126 EAST 56TH STREET NEW YORK NY 10222 2125545837 GENERAL MECHANICAL & PLUMBING MODIFICATIONS AS... 04/26/2013 12:00:00 AM
1 520129502 1 STATEN ISLAND 107 KNOX PLACE 342 1.0 5161350.0 A3 A ... BLUMENBERG NA 107 KNOX PLACE STATEN ISLAND NY 10314 3477398892 BUILDERS PAVEMENT PLAN 143 LF. ... 04/26/2013 12:00:00 AM
2 121601560 1 MANHATTAN 63 WEST 131 STREET 1729 9.0 1053831.0 A2 Q ... MARKOWITZ 635 RIVERSIDE DRIVE NY LLC 619 WEST 54TH STREET NEW YORK NY 10016 2127652555 GENERAL CONSTRUCTION TO INCLUDE NEW PARTITIONS... 04/26/2013 12:00:00 AM
3 121601203 1 MANHATTAN 48 WEST 25TH STREET 826 69.0 1015610.0 A2 D ... CASALE 48 W 25 ST LLC C/O BERNSTEIN 150 WEST 30TH STREET NEW YORK NY 10001 2125941414 STRUCTURAL CHANGES ON THE 5TH FLOOR (MOONDOG E... 04/26/2013 12:00:00 AM
4 121601338 1 MANHATTAN 45 WEST 29 STREET 831 7.0 1015754.0 A3 D ... LEE HYUNG-HYANG REALTY CORP 614 8 AVENUE NEW YORK NY 10001 2019881222 FILING HEREWITH FACADE REPAIR PLANS. WORK SCOP... 04/26/2013 12:00:00 AM

5 rows × 82 columns

In [17]:
# using bfill() function to back fill data with the value of 1
df5 = df2.copy()
df5 = df5.bfill(1)
df5.head()
Out[17]:
Job # Doc # Borough House # Street Name Block Lot Bin # Job Type Job Status ... Owner's Last Name Owner's Business Name Owner's House Number Owner'sHouse Street Name City State Zip Owner'sPhone # Job Description DOBRunDate
0 121577873 2 MANHATTAN 386 PARK AVENUE SOUTH 857 38 1.01689e+06 A2 D ... MIGLIORE MACKLOWE MANAGEMENT 126 EAST 56TH STREET NEW YORK NY 10222 2125545837 GENERAL MECHANICAL & PLUMBING MODIFICATIONS AS... 04/26/2013 12:00:00 AM
1 520129502 1 STATEN ISLAND 107 KNOX PLACE 342 1 5.16135e+06 A3 A ... BLUMENBERG NA 107 KNOX PLACE STATEN ISLAND NY 10314 3477398892 BUILDERS PAVEMENT PLAN 143 LF. ... 04/26/2013 12:00:00 AM
2 121601560 1 MANHATTAN 63 WEST 131 STREET 1729 9 1.05383e+06 A2 Q ... MARKOWITZ 635 RIVERSIDE DRIVE NY LLC 619 WEST 54TH STREET NEW YORK NY 10016 2127652555 GENERAL CONSTRUCTION TO INCLUDE NEW PARTITIONS... 04/26/2013 12:00:00 AM
3 121601203 1 MANHATTAN 48 WEST 25TH STREET 826 69 1.01561e+06 A2 D ... CASALE 48 W 25 ST LLC C/O BERNSTEIN 150 WEST 30TH STREET NEW YORK NY 10001 2125941414 STRUCTURAL CHANGES ON THE 5TH FLOOR (MOONDOG E... 04/26/2013 12:00:00 AM
4 121601338 1 MANHATTAN 45 WEST 29 STREET 831 7 1.01575e+06 A3 D ... LEE HYUNG-HYANG REALTY CORP 614 8 AVENUE NEW YORK NY 10001 2019881222 FILING HEREWITH FACADE REPAIR PLANS. WORK SCOP... 04/26/2013 12:00:00 AM

5 rows × 82 columns

In [18]:
# display the last 5 rows
df5.tail()
Out[18]:
Job # Doc # Borough House # Street Name Block Lot Bin # Job Type Job Status ... Owner's Last Name Owner's Business Name Owner's House Number Owner'sHouse Street Name City State Zip Owner'sPhone # Job Description DOBRunDate
322525 140713958 1 MANHATTAN 345 5TH AVENUE 863 4 1.01708e+06 A2 H ... OHEBSHALOM REGENCY ENTERPRISES PARTNERS LLC 347 FIFTH AVENUE NEW YORK NY 10016 2126865252 FILE FOR MECHANICAL AND SPRINKLER WORK AS PER ... 11/06/2017 12:00:00 AM
322526 140713976 1 MANHATTAN 517 COLUMBUS AVENUE 1198 7502 1.03134e+06 A2 H ... KONTOGIANNIS EGS, LLC 517 COLUMBUS AVENUE NEW YORK NY 10024 9177692434 INTERIOR ALTERATION OF EXISTING EATING AND DRI... 11/06/2017 12:00:00 AM
322527 340560164 1 BROOKLYN 1559 BUSHWICK AVE 3471 9 3.08049e+06 A2 D ... MELENDEZ 74 74 GRANT AVE QUEENS NY 11208 7188097151 TO INSTALL SPRINKLERS IN THE NEW RENOVATED BUI... 11/06/2017 12:00:00 AM
322528 140713949 1 MANHATTAN 345 5TH AVE 863 4 1.01708e+06 A2 H ... OHEBSHALOM REGENCY ENTERPRISES PARTNERS LLC 347 FIFTH AVENUE NEW YORK NY 10016 2126865252 FILE STRUCTURAL WORK TYPE AS PER PLAN. NO CHAN... 11/06/2017 12:00:00 AM
322529 540138331 1 STATEN ISLAND 62 MILBANK ROAD 4092 72 5.05711e+06 A2 H ... GRASSO NONE 62 MILBANK ROAD STATEN ISLAND NY 10306 3476238868 PROPOSED SPRINKLER SYSTEM IS DESIGNED IN ACCOR... 11/06/2017 12:00:00 AM

5 rows × 82 columns

There are other ways to handle null value.

In [19]:
# copy before testing
df6 = df2.copy()
# replace null value with the median value 
df6 = df6.fillna({'Age': 28})
df6.isnull().sum()
Out[19]:
Job #                            0
Doc #                            0
Borough                          0
House #                          0
Street Name                      0
Block                            0
Lot                              1
Bin #                            1
Job Type                         1
Job Status                       1
Job Status Descrp                1
Latest Action Date               1
Building Type                    1
Community - Board              186
Cluster                     322529
Landmarked                  277659
Adult Estab                 322511
Loft Board                  321153
City Owned                  283355
Little e                    319543
PC Filed                    322530
eFiling Filed                 2727
Plumbing                    177664
Mechanical                  211450
Boiler                      248965
Fuel Burning                252578
Fuel Storage                253139
Standpipe                   251930
Sprinkler                   235252
Fire Alarm                  240515
                             ...  
Vertical Enlrgmt            318326
Enlargement SQ Footage           2
Street Frontage                  2
ExistingNo. of Stories           2
Proposed No. of Stories          2
Existing Height                  2
Proposed Height                  2
Existing Dwelling Units     113673
Proposed Dwelling Units     112054
Existing Occupancy           47698
Proposed Occupancy           43975
Site Fill                   105712
Zoning Dist1                 40854
Zoning Dist2                282263
Zoning Dist3                320159
Special District 1          251875
Special District 2          301345
Owner Type                  322530
Non-Profit                  304277
Owner's First Name              10
Owner's Last Name               10
Owner's Business Name        41064
Owner's House Number            10
Owner'sHouse Street Name        18
City                             2
State                            3
Zip                              3
Owner'sPhone #                   4
Job Description               3571
DOBRunDate                      19
Length: 82, dtype: int64
In [20]:
# replace null value with the most popular value in a categorial columns
df6 = df6.fillna({'Embarked': 'S'})
df6.isnull().sum()
Out[20]:
Job #                            0
Doc #                            0
Borough                          0
House #                          0
Street Name                      0
Block                            0
Lot                              1
Bin #                            1
Job Type                         1
Job Status                       1
Job Status Descrp                1
Latest Action Date               1
Building Type                    1
Community - Board              186
Cluster                     322529
Landmarked                  277659
Adult Estab                 322511
Loft Board                  321153
City Owned                  283355
Little e                    319543
PC Filed                    322530
eFiling Filed                 2727
Plumbing                    177664
Mechanical                  211450
Boiler                      248965
Fuel Burning                252578
Fuel Storage                253139
Standpipe                   251930
Sprinkler                   235252
Fire Alarm                  240515
                             ...  
Vertical Enlrgmt            318326
Enlargement SQ Footage           2
Street Frontage                  2
ExistingNo. of Stories           2
Proposed No. of Stories          2
Existing Height                  2
Proposed Height                  2
Existing Dwelling Units     113673
Proposed Dwelling Units     112054
Existing Occupancy           47698
Proposed Occupancy           43975
Site Fill                   105712
Zoning Dist1                 40854
Zoning Dist2                282263
Zoning Dist3                320159
Special District 1          251875
Special District 2          301345
Owner Type                  322530
Non-Profit                  304277
Owner's First Name              10
Owner's Last Name               10
Owner's Business Name        41064
Owner's House Number            10
Owner'sHouse Street Name        18
City                             2
State                            3
Zip                              3
Owner'sPhone #                   4
Job Description               3571
DOBRunDate                      19
Length: 82, dtype: int64
In [37]:
# hello in Owner's Business name is inaccurate ...replacing inaccurate value with null value
df6 = df6.replace('None', np.nan)
df6.head()
Out[37]:
Job # Doc # Borough House # Street Name Block Lot Bin # Job Type Job Status ... Owner's Last Name Owner's Business Name Owner's House Number Owner'sHouse Street Name City State Zip Owner'sPhone # Job Description DOBRunDate
0 121577873 2 MANHATTAN 386 PARK AVENUE SOUTH 857 38.0 1016890.0 A2 D ... MIGLIORE MACKLOWE MANAGEMENT 126 EAST 56TH STREET NEW YORK NY 10222 2125545837 GENERAL MECHANICAL & PLUMBING MODIFICATIONS AS... 04/26/2013 12:00:00 AM
1 520129502 1 STATEN ISLAND 107 KNOX PLACE 342 1.0 5161350.0 A3 A ... BLUMENBERG NA 107 KNOX PLACE STATEN ISLAND NY 10314 3477398892 BUILDERS PAVEMENT PLAN 143 LF. ... 04/26/2013 12:00:00 AM
2 121601560 1 MANHATTAN 63 WEST 131 STREET 1729 9.0 1053831.0 A2 Q ... MARKOWITZ 635 RIVERSIDE DRIVE NY LLC 619 WEST 54TH STREET NEW YORK NY 10016 2127652555 GENERAL CONSTRUCTION TO INCLUDE NEW PARTITIONS... 04/26/2013 12:00:00 AM
3 121601203 1 MANHATTAN 48 WEST 25TH STREET 826 69.0 1015610.0 A2 D ... CASALE 48 W 25 ST LLC C/O BERNSTEIN 150 WEST 30TH STREET NEW YORK NY 10001 2125941414 STRUCTURAL CHANGES ON THE 5TH FLOOR (MOONDOG E... 04/26/2013 12:00:00 AM
4 121601338 1 MANHATTAN 45 WEST 29 STREET 831 7.0 1015754.0 A3 D ... LEE HYUNG-HYANG REALTY CORP 614 8 AVENUE NEW YORK NY 10001 2019881222 FILING HEREWITH FACADE REPAIR PLANS. WORK SCOP... 04/26/2013 12:00:00 AM

5 rows × 82 columns

In [36]:
# 350 is an inaccurate value for age and should be replaced by 35
# changinge single value with replace
df6 = df6.replace({"Owner's Business Name": 'NA'}, {"Owner's Business Name": 'No Name Given'})
df6.head()
Out[36]:
Job # Doc # Borough House # Street Name Block Lot Bin # Job Type Job Status ... Owner's Last Name Owner's Business Name Owner's House Number Owner'sHouse Street Name City State Zip Owner'sPhone # Job Description DOBRunDate
0 121577873 2 MANHATTAN 386 PARK AVENUE SOUTH 857 38.0 1016890.0 A2 D ... MIGLIORE MACKLOWE MANAGEMENT 126 EAST 56TH STREET NEW YORK NY 10222 2125545837 GENERAL MECHANICAL & PLUMBING MODIFICATIONS AS... 04/26/2013 12:00:00 AM
1 520129502 1 STATEN ISLAND 107 KNOX PLACE 342 1.0 5161350.0 A3 A ... BLUMENBERG NA 107 KNOX PLACE STATEN ISLAND NY 10314 3477398892 BUILDERS PAVEMENT PLAN 143 LF. ... 04/26/2013 12:00:00 AM
2 121601560 1 MANHATTAN 63 WEST 131 STREET 1729 9.0 1053831.0 A2 Q ... MARKOWITZ 635 RIVERSIDE DRIVE NY LLC 619 WEST 54TH STREET NEW YORK NY 10016 2127652555 GENERAL CONSTRUCTION TO INCLUDE NEW PARTITIONS... 04/26/2013 12:00:00 AM
3 121601203 1 MANHATTAN 48 WEST 25TH STREET 826 69.0 1015610.0 A2 D ... CASALE 48 W 25 ST LLC C/O BERNSTEIN 150 WEST 30TH STREET NEW YORK NY 10001 2125941414 STRUCTURAL CHANGES ON THE 5TH FLOOR (MOONDOG E... 04/26/2013 12:00:00 AM
4 121601338 1 MANHATTAN 45 WEST 29 STREET 831 7.0 1015754.0 A3 D ... LEE HYUNG-HYANG REALTY CORP 614 8 AVENUE NEW YORK NY 10001 2019881222 FILING HEREWITH FACADE REPAIR PLANS. WORK SCOP... 04/26/2013 12:00:00 AM

5 rows × 82 columns

What if the data that is bad (or missing) is a time or date stamp?

You can replace the bad data, if you know the column, by pandas object.

In [23]:
# pandas object
df7 = df2.copy()
# for one column
df7['Pre- Filing Date'] = pd.Timestamp('20130425')
df7.tail()
Out[23]:
Job # Doc # Borough House # Street Name Block Lot Bin # Job Type Job Status ... Owner's Last Name Owner's Business Name Owner's House Number Owner'sHouse Street Name City State Zip Owner'sPhone # Job Description DOBRunDate
322525 140713958 1 MANHATTAN 345 5TH AVENUE 863 4.0 1017075.0 A2 H ... OHEBSHALOM REGENCY ENTERPRISES PARTNERS LLC 347 FIFTH AVENUE NEW YORK NY 10016 2126865252 FILE FOR MECHANICAL AND SPRINKLER WORK AS PER ... 11/06/2017 12:00:00 AM
322526 140713976 1 MANHATTAN 517 COLUMBUS AVENUE 1198 7502.0 1031339.0 A2 H ... KONTOGIANNIS EGS, LLC 517 COLUMBUS AVENUE NEW YORK NY 10024 9177692434 INTERIOR ALTERATION OF EXISTING EATING AND DRI... 11/06/2017 12:00:00 AM
322527 340560164 1 BROOKLYN 1559 BUSHWICK AVE 3471 9.0 3080494.0 A2 D ... MELENDEZ NaN 74 GRANT AVE QUEENS NY 11208 7188097151 TO INSTALL SPRINKLERS IN THE NEW RENOVATED BUI... 11/06/2017 12:00:00 AM
322528 140713949 1 MANHATTAN 345 5TH AVE 863 4.0 1017075.0 A2 H ... OHEBSHALOM REGENCY ENTERPRISES PARTNERS LLC 347 FIFTH AVENUE NEW YORK NY 10016 2126865252 FILE STRUCTURAL WORK TYPE AS PER PLAN. NO CHAN... 11/06/2017 12:00:00 AM
322529 540138331 1 STATEN ISLAND 62 MILBANK ROAD 4092 72.0 5057106.0 A2 H ... GRASSO NONE 62 MILBANK ROAD STATEN ISLAND NY 10306 3476238868 PROPOSED SPRINKLER SYSTEM IS DESIGNED IN ACCOR... 11/06/2017 12:00:00 AM

5 rows × 82 columns

In [24]:
# for multiple columns
df7['Paid'] = pd.Timestamp('20130425')
df7['Fully Paid'] = pd.Timestamp('20130425')
df7['Assigned'] = pd.Timestamp('20130425')
df7['Approved'] = pd.Timestamp('20130425')
df7['Fully Permitted'] = pd.Timestamp('20130425')

This would be handy if you wanted to replace the data with known data, just so there's something listed.

What we really need to do is determine how many columns are data needed for real analysis.

For example, in this dataset, "Job Status Descrp", isn't necessary. As we learned earlier, just drop that column.
In [25]:
# drop all columns that contain missing or null data.
df8 = df2.copy()
df8 = df8.dropna(axis=1)
df8.head()
Out[25]:
Job # Doc # Borough House # Street Name Block
0 121577873 2 MANHATTAN 386 PARK AVENUE SOUTH 857
1 520129502 1 STATEN ISLAND 107 KNOX PLACE 342
2 121601560 1 MANHATTAN 63 WEST 131 STREET 1729
3 121601203 1 MANHATTAN 48 WEST 25TH STREET 826
4 121601338 1 MANHATTAN 45 WEST 29 STREET 831
In [26]:
# last 5 rows
df8.tail()
Out[26]:
Job # Doc # Borough House # Street Name Block
322525 140713958 1 MANHATTAN 345 5TH AVENUE 863
322526 140713976 1 MANHATTAN 517 COLUMBUS AVENUE 1198
322527 340560164 1 BROOKLYN 1559 BUSHWICK AVE 3471
322528 140713949 1 MANHATTAN 345 5TH AVE 863
322529 540138331 1 STATEN ISLAND 62 MILBANK ROAD 4092

So what data is left after dropping all rows with missing data?

In [27]:
# data info
df8.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 322530 entries, 0 to 322529
Data columns (total 6 columns):
Job #          322530 non-null int64
Doc #          322530 non-null int64
Borough        322530 non-null object
House #        322530 non-null object
Street Name    322530 non-null object
Block          322530 non-null int64
dtypes: int64(3), object(3)
memory usage: 14.8+ MB

We have dropped from 322,530 rows with 82 columns, down to just 6 columns!

This is not much data to work with now.

We can also use visualization techniques to find the bad data.

First, set up the X and Y variables.

In [28]:
df9 = pd.DataFrame(df2)
df9['x'] = df9['Existing Zoning Sqft']
df9['y'] = df9['Proposed Zoning Sqft']
In [29]:
# Scatter plot
sns.lmplot('x', 'y', data=df9, fit_reg=False)
Out[29]:
<seaborn.axisgrid.FacetGrid at 0x21a16630>
In [30]:
# Box plot
sns.boxplot([df9.x, df9.y])
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x53b5aa20>

Any bad value will be outside of the box.

In [31]:
# Density plot
sns.kdeplot(df9['x'])
C:\Users\Gene\Anaconda2\lib\site-packages\statsmodels\nonparametric\kde.py:454: RuntimeWarning:

invalid value encountered in greater

C:\Users\Gene\Anaconda2\lib\site-packages\statsmodels\nonparametric\kde.py:454: RuntimeWarning:

invalid value encountered in less

Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x566a91d0>

Here it would be the value that is "out of scope" compared to all other values.

Using plotly to visualize csv.

In [32]:
# iplot "scatterplot" for visualization.
dt = [Scatter(x=df9.x, y=df9.y, mode = 'markers')]
iplot(dt)

Pivot tables can also be used to see your data.

In [33]:
# create a pivot, which helps you develop insights about the data in question.
df10 = df8.copy()
df10.groupby('Job #').count()
Out[33]:
Doc # Borough House # Street Name Block
Job #
102416798 2 2 2 2 2
102416805 1 1 1 1 1
102416841 1 1 1 1 1
103643844 2 2 2 2 2
103643853 1 1 1 1 1
103643862 1 1 1 1 1
103643871 1 1 1 1 1
103643880 1 1 1 1 1
103643899 1 1 1 1 1
103643906 3 3 3 3 3
103643915 1 1 1 1 1
103643924 2 2 2 2 2
103643933 1 1 1 1 1
103643942 1 1 1 1 1
103643951 1 1 1 1 1
103643960 4 4 4 4 4
103643979 3 3 3 3 3
103643988 1 1 1 1 1
103643997 2 2 2 2 2
103644004 1 1 1 1 1
103644013 2 2 2 2 2
103644022 1 1 1 1 1
103644031 3 3 3 3 3
103644040 1 1 1 1 1
103644059 1 1 1 1 1
103644068 2 2 2 2 2
103644077 1 1 1 1 1
103644086 1 1 1 1 1
103644095 1 1 1 1 1
103644102 1 1 1 1 1
... ... ... ... ... ...
540137822 1 1 1 1 1
540137831 1 1 1 1 1
540137920 1 1 1 1 1
540137948 1 1 1 1 1
540137957 1 1 1 1 1
540137966 1 1 1 1 1
540137975 1 1 1 1 1
540137984 1 1 1 1 1
540137993 1 1 1 1 1
540138000 1 1 1 1 1
540138019 1 1 1 1 1
540138037 1 1 1 1 1
540138126 1 1 1 1 1
540138135 1 1 1 1 1
540138144 1 1 1 1 1
540138153 1 1 1 1 1
540138162 1 1 1 1 1
540138171 1 1 1 1 1
540138180 1 1 1 1 1
540138199 1 1 1 1 1
540138206 1 1 1 1 1
540138224 1 1 1 1 1
540138233 1 1 1 1 1
540138251 1 1 1 1 1
540138260 1 1 1 1 1
540138279 1 1 1 1 1
540138297 1 1 1 1 1
540138304 1 1 1 1 1
540138313 1 1 1 1 1
540138331 1 1 1 1 1

289582 rows × 5 columns

Once you have cleaned your data, you can save it back again.

In [34]:
# save back to csv file (this example is only partially clean; you want to make sure you have used all code to clean the data first)
df10.to_csv("data/cleaned_data.csv", encoding="utf-8")

Hopefully, this tutorial has helped you learn some basic data cleaning.